Identification of novel putative alleles related to important agronomic traits of wheat using robust strategies in GWAS

Principal component analysis (PCA) is widely used in various genetics studies. In this study, the role of classical PCA (cPCA) and robust PCA (rPCA) was evaluated explicitly in genome-wide association studies (GWAS). We evaluated 294 wheat genotypes under well-watered and rain-fed, focusing on spike traits. First, we showed that some phenotypic and genotypic observations could be outliers based on cPCA and different rPCA algorithms (Proj, Grid, Hubert, and Locantore). Hubert’s method provided a better approach to identifying outliers, which helped to understand the nature of these samples. These outliers led to the deviation of the heritability of traits from the actual value. Then, we performed GWAS with 36,000 single nucleotide polymorphisms (SNPs) based on the traditional approach and two robust strategies. In the conventional approach and using the first three components of cPCA as population structure, 184 and 139 marker-trait associations (MTAs) were identified for five traits in well-watered and rain-fed environments, respectively. In the first robust strategy and when rPCA was used as population structure in GWAS, we observed that the Hubert and Grid methods identified new MTAs, especially for yield and spike weight on chromosomes 7A and 6B. In the second strategy, we followed the classical and robust principal component-based GWAS, where the first two PCs obtained from phenotypic variables were used instead of traits. In the recent strategy, despite the similarity between the methods, some new MTAs were identified that can be considered pleiotropic. Hubert's method provided a better linear combination of traits because it had the most MTAs in common with the traditional approach. Newly identified SNPs, including rs19833 (5B) and rs48316 (2B), were annotated with important genes with vital biological processes and molecular functions. The approaches presented in this study can reduce the misleading GWAS results caused by the adverse effect of outlier observations.

Bread wheat (Triticum aestivum L.) is vital to the world's agricultural economy. Understanding the basis of its genetic and phenotypic diversity is essential to increase grain yield. Such a goal requires statistical methods with the correct application 1 . In this regard, cPCA has been one of the most important, simplest, and most widely used statistical tools for breeders. PCA aims to reduce data complexity in several principal components (PC). This multivariate method, in addition to studies of phenotypic and molecular diversity 2,3 , has been used in confirming population structure 4,5 , genotype selection 6 , understanding the pattern of genotype-by-environment interactions 7,8 , and selection of traits for yield modeling 9 . Despite its widespread use, the results of this technique can be highly biased in population genetic research 10 and GWAS 11 , which have been cited as potential pitfalls 12 .
Most of the problems of PCA are due to the high sensitivity of this method to the presence of outliers in the data. Phenotypic and genotypic data are susceptible to outlier samples, regardless of the cause. Outlier observations have been observed in single-site and multi-environment trials 13,14 , DNA-seq 15,16 , and RNA-seq 17 data. Therefore, the presence of such observations in the data is inevitable and violates the underlying assumptions of many statistical analyses 18 . The problem of outlier observations does not end only with their adverse effects, but identifying these observations and managing them is a very challenging task 19,20 . So far, several statistical

Materials and methods
Plant materials and phenotyping. A total of 294 Iranian bread wheat genotypes, including 90 cultivars released during the last century and 204 landraces collected from different locations (Supplementary Table 1), were evaluated in a randomized complete block design under well-watered (normal irrigation) and rain-fed environments. The experiments were carried out in a research farm with coordinates 50.58 E and 35.56 N and an altitude of 1112.5 m above sea level in 2019-2020. Climate information is presented in Supplementary Table 2 according to the months of the experiment. Five spikes from each genotype were randomly selected, and different traits were measured after being transferred to Urmia University. Spike weight (SW), grain number per spike (GN), grain yield (GY), thousand kernel weight (TKW), and spike internode length (SIL) were recorded. Spike internode length was obtained from the ratio of spike axis length to the number of nodes per spike 44 . The broadsense heritability of these traits was calculated through the following equation: where δ 2 g , δ 2 p , and δ 2 ge are the genotypic, phenotypic, and genotype-by-environment interaction variances, respectively. δ 2 ε is the error variance, Also, n e and n r are the number of environments and replicates, respectively. Finally, H 2 BS was expressed as a percentage.
Genotyping. Iranian wheat landraces and cultivars were genotyped using the genotyping-by-sequencing (GBS) technique, the details of which have been described previously 45 . In this method, SNPs were called using the UNEAK GBS pipeline 46 , and imputation was performed in BEAGLE v3. 3.273 47 using the w7984 reference genome. For more details, refer to Alipour et al. 48 . Finally, SNPs with heterozygotes greater than 10% and minor allele frequency less than 5% were removed, and 36,000 SNPs remained for GWAS. The distribution of SNP markers in 21 chromosomes, kinship relationships, and genetic diversity indices have been reviewed in more detail in our previous reports 25,49 . Linkage disequilibrium (LD). Genome-wide linkage disequilibrium (LD) analysis was checked by calculating pairwise marker allele squared correlation coefficient (r 2 ) for all pairwise comparisons. The LD decay pattern for the whole genome and separately for A, B, and D genomes was displayed by plotting pairwise r 2 values against genetic distance (bp).
Classical and robust principal component analysis. We used classical PCA analysis and four robust PCA (Proj, Grid, Hubert, and Locantore) methods to evaluate the population structure and identify outliers. Multivariate detection of outlier observations includes examining each observation based on a combination of variables. Usually, researchers deal with more than two variables in their data, so it is necessary to determine outlier observations in terms of a combination of variables. PCA uses the distance of the sample scores as a criterion for investigating outlier observations. The greater the distance between the observed scores, the higher the possibility of outliers. Most rPCAs use the Projection-Pursuit (PP) principle. PP methods are suitable for analyzing ge n e + δ 2 ε n e n r www.nature.com/scientificreports/ data sets with many variables, and they aim to find structures in multivariate data by projecting them on a lowdimensional subspace. These methods maximize a robust measure of spread to obtain consecutive directions on which the data points are projected 50 . Here, the above four PCA methods are briefly described. The Grid method uses the PP technique to calculate PCA estimators, computed via a grid search algorithm in the plane rather than p-dimensional space 51 . This algorithm leads to higher amounts of explained variability. Hubert's method, called the ROBPCA algorithm, combines PP and robust covariance estimation, i.e., minimum covariance determinant (MCD) techniques, to compute the robust loadings 52 . This rPCA method is resistant to outliers in the data and uses the MCD method for low-dimensional data sets. It is also very suitable for high-dimensional data 53 . ROB-PCA yields more accurate estimates of noncontaminated datasets and more robust estimates of contaminated data. In addition, the Hubert model produces a diagnostic plot that displays and classifies outliers 52 . The spherical principal components procedure, called the Locantore method, is a functional data analysis method. This method performs cPCA on the data but is projected onto a unit sphere. The Locantore approach is very fast, and the estimates of the eigenvectors are consistent. This method is intended to explore the structure of populations of complex objects 54 . Finally, the Proj approach, which is as fast as the previous method, was used. This approach includes a simple algorithm for approximating the PP-estimators for PCA whose PCs can be sequentially computed. In other words, the Proj approach uses PP to calculate the robust eigenvalues and eigenvectors without going through robust covariance estimation 53,55 . The cPCA and rPCA analyses were performed using the rrcov package in the R program 56 . We used the outlier plots provided by these methods to identify outlier observations. Also, The first and second PCs were plotted to obtain a biplot.
Traditional single trait-GWAS. First, GWAS analysis was performed between SNP markers and average phenotypic data for each individual trait in each environment. Since the mixed linear model (MLM) is a standard method in GWAS for various wheat traits 57 and has better control over confounding effects 25 , we used this model. In the MLM model, two covariates are used to prevent false positives: (i) matrix k, which represents kinship relations or family relatedness, and (ii) the first three PCs of cPCA, which are considered as the population structure (stratification) in the model. Simply put, the equation of this model is as follows: where Y represents the studied trait, SNP provides genotypic information, PCs are the population structure, and the kinship represents the relationship between individuals in the population using genotypic information, and e is the residual error 57 . In the MLM model, individuals are considered random effects, and the relatedness among individuals is conveyed through a kinship matrix.
Single trait-GWAS with robust principal component covariates. As we mentioned earlier, classical PCA may not correctly represent the population structure. Therefore, in the second GWAS scenario, we used the four described rPCA methods as population structure. This work was done with the aim of obtaining different linear combinations of genotypic data to moderate the effect of genotypic outliers on GWAS results. Because rPCAs perform better in the presence of outlier observations.

Classical and robust principal component-based GWAS.
PCs have been used instead of phenotype variables in GWAS in recent years. Since GWAS for PCs obtained from phenotypic data can be more valuable than traditional single trait-GWAS, we used two PCs instead of the traits themselves in the third scenario. The first two PCs were used because they explained more than 90% of the phenotypic changes, and the remaining PCs did not provide specific information. We also used the PCs obtained from cPCA and four rPCA methods. This work was done to reduce the effect of outlier observations from phenotypic data. In all three scenarios above, the threshold of − log 10 (p) > 3 was used to state statistically significant markertrait associations (MTAs). Confidence intervals (CIs) for MTAs were calculated using the linkage disequilibrium (LD) decay. GWAS analyses were performed using genome association and prediction integrated tool (GAPIT) R-package 58 , but the necessary PCs were extracted through the rrcov package in the R program 56 . The quantile-quantile (Q-Q) plots of the observed and expected P values were plotted at − log10 (p) values to assess the adequacy of a fitted normal straight line to the markers. Finally, Venn diagrams were drawn using an online tool (https:// bioin forma tics. psb. ugent. be/ webto ols/ Venn/) and the t-test was performed between the alleles of each SNP to compare the results.

Gene ontology (GO).
The sequences surrounding all associated SNPs in the EnsemblPlants (http:// plants. ensem bl. org/ index. html) database were annotated using basic local alignment search tools (BLAST) and aligned with IWGSC v2.1 (International Wheat Genome Sequencing Consortium database) reference genome 59 . After aligning the SNP sequences with the reference genome, overlapping genes with the highest percentage identity (%ID) and the lowest expected value (E-val) were selected for further processing and interpretation. Then, the ontology of genes, i.e., molecular function, biological process, and their cellular component, was extracted from the ensemble-gramene database. It should be noted that markers with almost identical sequences were filtered. Finally, the gene network was analyzed using the genes identified in the database for annotation, visualization, and integrated discovery (DAVID) bioinformatics resources (https:// david. ncifc rf. gov/). This tool examines gene pathways based on the Kyoto Encyclopedia of Gene and Genomes (KEGG) enrichment analysis [60][61][62] (https:// www. genome. jp/ kegg/; www. kegg. jp/ kegg/ kegg1. html).

Results
Identification of outlier observations. Outlier samples were identified using cPCA and different rPCA methods based on phenotypic data in both environments and genotypic data (Table 1). These samples are determined based on the outlier plots provided by the rrcov R-package (outlier plots are not reported). Eight genotypes were identified as outlier samples in well-watered and rain-fed environments, and only genotype 182 was common among them. Based on genotypic data, 16 samples were outliers. Genotypes 4, 200, and 252 were outliers under well-watered using cPCA, and only genotype 252 was confirmed by other rPCA methods. However, in the rain-fed environment, all three genotypes that were identified as outliers by cPCA were also confirmed by rPCA methods. On the other hand, some outlier samples were identified only by robust methods. According to the results, Hubert's approach differs from other methods, especially in genotypic data. The Hubert method divides the outlier plot into four parts ( Supplementary Fig. 1). The normal samples are on the bottom left and are different from the other three outlier types because this category's score and orthogonal distances were low. The bottom right in the well-watered environment included genotypes No. 182, the rain-fed environment included genotypes No. 182 and 219, and the genotypic data included genotypes No. 19, 23, and 24, which had high score distance and low orthogonal distance. In the upper left space of the outlier plot, the genotypes in which the orthogonal distance is high but the score distance is low, are located. It is interesting to www.nature.com/scientificreports/ note that in the phenotypic data, we did not see an example in this area, but in the genotypic data, three genotypes (219, 262, and 283) were located there. The upper right will contain samples strongly deviated from normal samples and have high scores and orthogonal distances. We found such outlier examples only for phenotypic data. These results can help to understand the nature of outliers and determine their type. There was a significant difference between genotypes in both environments for all traits. Also, except for SIL, the mean of traits in the rain-fed environment decreased (Supplementary Table 3). Descriptive statistics showed little changes by removing two genotypes (105 and 252) that were outliers by most methods. These changes were different from one trait to another. In SW, the mean decreased, but the genotypic variance increased. This mode for GN was increasing and decreasing, respectively. As expected, SD and δ 2 GE decreased for all traits (except SIL). The amount of error variance also remained almost constant. The highest and lowest heritability were related to SIL and GY traits, respectively. Removing two outlier observations caused the heritability to increase by about 2% in some traits ( Table 2).
Linkage disequilibrium and population structure. The pattern of LD decay differed between subgenomes, ranging from 1707 to 5752 bp, so its level was high in the D genome and low in the B genome (Supplementary Fig. 2). The distribution of genotypes based on the first two PCs in all methods showed a clear distinction between Iranian wheat cultivars and landraces (Fig. 1). Some cultivars were among landraces. By looking closely at the pedigree of these cultivars, we found that some of them, such as Ohadi, Roshan, and Homa, are originally landraces that have experienced various selection processes. Although, in general, the distribution pattern of genotypes was similar in all five PCA methods, it seems that the Grid approach is different from others (Fig. 1).
Traditional single trait-GWAS. In GWAS analysis for five traits, 184 and 139 MTAs were identified in the well-watered and rain-fed environments, respectively. The details of these results are provided in Supplementary  Tables 4 and 5. Briefly, In the well-watered, 38 SNPs were associated with GN, about 32% of which were located on chromosome 7A, and 90% of the markers had a significant difference between their two alleles in terms of GN (Table 3). For the same trait in rain-fed, 23 MTAs were discovered, mainly distributed in the A genome as in well-watered, and 60% of them were different between their two alleles. For SW under well-watered and rainfed environments, respectively, we observed 19 MTAs (mostly in genome A) and 18 MTAs (mostly in genome B), of which 68% and 55% differed between their alleles in terms of average SW. Among the traits studied, most MTAs were assigned to SIL. 65 and 53 MTAs were associated with SIL in well-watered and rain-fed environments, respectively, and most of these SNPs were located on chromosomes 6B and 2B. After SIL, TKW, with 44 MTAs in a well-watered environment and 32 MTAs in a rain-fed environment, had the highest number of MTAs located in chromosome 2B. Mostly, we found a statistically significant difference between the two alleles of these markers in terms of average TKW. Finally, GY was significantly associated with 18 and 13 SNPs in well-watered and rain-fed, respectively, which were distributed in almost half of the 21 wheat chromosomes. Like the previous traits, the markers related to GY differed between their two alleles in terms of the average trait.  Supplementary Table 4. In addition, Venn plots are drawn in Fig. 2 to show the number of common SNPs between different methods. The effect of population structure on GWAS results varied from one trait to another. In both environments, the number of common SNPs between cPCA and rPCA methods was high for GN, SIL, and TKW traits but low for traits such as GY and SW. Using the PCs provided by Hubert and Grid methods as population structure identified some new markers, especially in SW, GY, and SIL traits. Also, several new markers were identified using a covariate of the population structure by the Proj method in most traits. In some cases, the new SNPs identified by the rPCA covariate were in the same chromosomal regions as those identified by the classical method. However, others were located Table 2. Mean, standard deviation (SD), variance components, and general heritability (H 2 BS ) for the studied traits in two environments. δ 2 G , δ 2 GE , and δ 2 E represent the genotypic, genotype-by-environment, and error variances, respectively $ In "no outlier", genotypes 105 and 252, which were outliers by most methods, have been removed from the total data. *** significant at 0.001 probability levels by F test of genotypic variance. We evaluated the fitness and efficiency of different models with Q-Q plots. The Q-Q plots of the expected − log10 (p) versus the observed − log10 (p) for PCA covariates were almost the same in both wellwatered (Fig. 3) and rain-fed (Fig. 4) environments. In general, in these methods, the observed values were close to the expected values. Only at the tail of the point distribution did we see a deviation from the expected value, indicating significant marker effects. However, minor differences are also visible; for example, the values observed in Hubert's method for GY in well-watered were closer to the expected values. Also, the p-values for GY under rain-fed seems slightly inflated, which is solved in Hubert's method.    www.nature.com/scientificreports/ We identified several markers associated with different traits, including two markers located in chromosomes 2A and 5A that were associated with GN, GY, and SW in a well-watered environment by all methods. Markers in chromosomes 4A, 2B, and 5B were associated with the above three traits in rain-fed. The highest number of identical markers in both environments was related to GY and SW. On the other hand, except for TKW, at least one marker was the same for the other four traits in both well-watered and rain-fed environments, which can be considered stable QTLs. As we expected, there was a significant difference between the two alleles of most of the identified SNPs in terms of the studied traits. Nevertheless, significant SNPs were high for traits such as SIL, TKW, and GN in well-watered and TKW in rain-fed. SNPs identified by Hubert and Proj methods had the highest percentage of significant markers for some traits (Table 3).   (Fig. 5A).  Table 5). In the classical method, the markers associated with PC1 and PC2 were mainly located in the A and B genomes, respectively. Unlike other methods with the most MTAs in chromosome 2B for PC2, in Grid and Proj methods, this chromosome contained more SNPs related to PC1. In Hubert's method, chromosomes 1A, 2A, 6A, and 6D had a high association with PC1, and chromosomes 2B and 6B with PC1. Almost the same results as Hubert's were obtained for two PCs from the Locantore method. Interestingly, 18 MTAs were identified for components of the Locantore method that were not present in the others. On the other hand, the SNPs associated with the Hubert method components wholly overlapped with those associated with a single trait (Fig. 5B).

Gene annotation and Kyoto encyclopedia of genes and genomes (KEGG) pathways.
Among all identified markers, 77 overlapped with a range of genes mainly located on chromosomes 2B, 3B, 6B, 1B, 7A, 4A, and 6A (Supplementary Table 6). Some of these genes had biological processes and molecular functions, forming part of cellular components such as membranes and nuclei. Defense response, protein phosphorylation, regulation of transcription, and DNA template are the most visible in biological processes. On the other hand, protein binding, ADP binding, DNA binding, iron ion binding, etc., were molecular functions of genes. The results of GO enrichment analysis and testing of statistically enriched pathways were performed in the KEGG, and some pathways, including biosynthesis of flavonoid, carotenoid, and secondary metabolites, were identified. Other pathways were metabolic, ubiquitin-mediated proteolysis, protein processing, plant-pathogen interaction, and fatty acid elongation (Table 4). Interestingly, some KEGG pathways were detected only by Hubert's method.

Discussion
PCA is a data dimensionality reduction method. In the analysis above, it is possible to convert many primary variables into a few new principal components, i.e., linear combinations of the original variables, that express the most variance observed in the original data. In recent years, PCA has also been used to help detect outliers.   19,21,63 . Outliers samples are observations that fall outside the general distribution pattern, and contamination of data with these samples is more of a rule than an exception 53 . Outliers can be caused by various factors such as low seed quality, outlying response, misidentification of genotype, wrong data imputation, etc. 20,64 . Looking more closely at the outliers, we found that genotype No. 252, which was an outlier under well-watered by all methods, had poor seed quality and shriveled because it had high GN and low TKW. Probably, genotype No. 105 had a wrong classification in well-watered because its parameters were similar and even higher than the average of well-watered. Such deviations can have a high impact on the genotypic average 65 .
In the present study, phenotypic data compared to genotypic data, different methods identify the same samples as outliers. This is probably due to the high dimensionality in the genotypic data. Determining outliers in highdimensional genomic data is difficult 17,20 . However, using different methods to discover them can be useful. In the outlier plot obtained from the ROBPCA algorithm (Hubert), it was found that no outlier sample had high scores and orthogonal distances based on genotypic data. If the number of accessions is high, the outlier status of some samples will change. Genotype No. 161 was present in a core set designed from among 2403 Iranian wheat landraces, and 12% of other genotypes were in outlier groups 16 . In analyzing the diversity of 80,000 wheat accessions worldwide, it was found that some are outliers. A closer examination indicated that some samples were misclassified in their passport information 66 . This is an example of the application of identifying outlier samples. Also, tools have been developed based on PCA analysis to detect outliers 67 to, e.g., determine the SNPs under selection, and involved in biological adaptation, and have been used in various crops 15,68,69 .
A good separation of cultivars from landraces based on PCA was expected because there are traces of foreign origin in the pedigree of many Iranian wheat cultivars. Some of the cultivars mixed with the landraces are old and the detailed information about their pedigree is not available, but they are most likely of landraces origin. In this regard, Khadka et al. 4 have discussed the possibility of mistakenly collecting exotic germplasm instead of landraces in Nepalese wheat during the 1970s to 1990s. According to our results, it has been reported that PCA based on SNPs can divide wheat accessions into separate groups according to breeding origin 32,70 . In our study, the distribution of landraces was not strictly based on their geographical origin where they were collected. The subgroups identified as population structure can represent different wheat breeding programs 33 . Therefore, using PCs as a population structure in GWAS can avoid false positive associations.
Efforts to increase wheat yields continue to meet human needs. In this regard, evaluating spike traits is critical due to their direct relationship with the grain. Several studies have emphasized the importance of spike traits [71][72][73] . As it was apparent, the traits studied under well-watered showed a decrease in average. SIL was an exception to this rule because the number of nodes decreased in parallel with the reduction in spike length. Heritability range varied from 50% in GY to 75% in SIL. The heritability estimated in this study for SIL is almost equal to that reported in a similar study 44 . The high heritability of this trait might be due to being mainly controlled by genetic effects. While for GY, which has complex control, heritability was low 25,74 . Our results demonstrate the distorting effect of outlier samples in estimating heritability, as previously mentioned 65 . Due to the incorrect estimation of heritability in the presence of outliers, it is suggested to use the robust estimation of heritability and to pay attention to these methods along with the classical approach 18 . In another study, the robust DF-REML framework for estimating variance components was proposed, which could provide a robust estimate of heritability 75 . However, many reports still do not pay attention to outlier data, which seems to be due to a lack of deep understanding of the distorting effects of this data.
We performed GWAS based on three approaches. Although these methods had a lot in common in highly significant MTAs, there was a difference between the methods in terms of significant SNPs at the level of − log 10 (p) > 3. Such changes in GWAS results are because genomic data are rarely of good quality and are contaminated with outliers, missing values, and noise 20 . Q-Q plot is a convenient tool that shows control of type I error in MTA detection and can be used for model selection in GWAS [76][77][78] . Population stratification or cryptic relatedness leads to systematic deviation from the diagonal at the upper-right end of the Q-Q plots 79 . Based on this plot, there was not much difference in the results. This result is probably why we did not have outlier observation in SNP data of the third type (both score and orthogonal distances high). However, it seems that using PCs obtained from Hubert's approach can be useful in data with outlier observations of the first and second type. Also, other rPCA methods need to be investigated in diverse populations because we saw slight differences in the deviation from the diagonal at the Q-Q plots of different methods. Finally, using cPCA in genotyping data that are not contaminated by outlier observations is better.
The distribution of GY-related markers in different chromosomes has been observed before 29,80 . 4A was the chromosome that contained MTAs for GY in both environments; even SNP rs12851 on this chromosome showed its significant association in both environmental conditions. The same result was reported in similar studies that observed a stable association between chromosome 4A and yield under drought stress 81,82 . In addition, SNP rs5823, associated with GN and PC1 in all methods, was located on chromosome 4A. Interestingly, it was present in the carotenoid biosynthesis pathway and supported the cellular anatomical entity. When we used Hubert's PC for population structure, chromosome 7A was most associated with GY. According to the annotation, SNP rs30826 on this chromosome played a role in the membrane structure and was present in the plant-pathogen interaction pathway. It has been proven that there are crosstalks between responses to biotic and abiotic stresses, and some stress-resistant genes can respond to biotic and abiotic stresses 83 . Also, the plant-pathogen interaction pathway was involved in the drought stress resistance of transgenic wheat 84 . In a robust GWAS analysis, to overcome the problems of outlier observations, the linear mixed model approach was strengthened using the β-divergence method. This method performed better than the linear regression and mixed model approaches in the presence of outlier data and identified new SNPs that can be used in breeding programs 11 . The combination of GWAS and t-tests help identify significant SNPs 85 , confirming CAPS markers 86 , and identifying favorable SNP alleles 87 . Hence, we tracked the t-test for a more accurate assessment of the changes in phenotypic data due www.nature.com/scientificreports/ to allelic variation. The Hubert and Proj method had the highest percentage of significant SNPs in some traits compared to the classical method. The non-significant SNPs between the two alleles in terms of the t-test were distributed in most traits in a range of chromosomes; however, 2B and 7A had the highest number of these SNPs for SIL and GN traits, respectively. Principal component-based GWAS is statistically more powerful than single-trait GWAS and requires less computational time than multi-trait GWAS. Therefore, it is an efficient method to identify pleiotropic markers 39 . If the first two PCs explain a high percentage of trait variation, they can be suitable enough for GWAS 38,40,41 . In the present study, these two PCs captured more than 90% of the variation. Chromosome 7A had a high percentage of MTAs related to PC1 in all methods. The mentioned chromosome is important due to its association with multiple agronomic traits 88 and contains pleiotropic QTL 89,90 . Chromosome 2B was among the other regions associated with PCs in both environments. Pleiotropic loci have been reported in chromosome 2B 91 . Interestingly, under well-watered, SNP rs7805 located on chromosome 3B was associated with PC1 in all methods, while it was not significantly associated with any trait. 3B is a chromosome that simultaneously controls several yield components 25,92 . Such QTLs controlling two or more traits lead to high genetic correlation in many traits. The presence of such QTLs in the present study is not far from expected because the correlation between spike traits is certain 27,44 .
Although PC-GWAS increases statistical power and has led to the identification of new SNPs in some GWAS studies 40,41,93,94 , the presence of even a single outlier has a negative effect on PCA results 95 . The impact of outlier samples on GWAS results has already been discussed. It has been stated that the number and identity of QTLs identified through GWAS are influenced by the curation and preparation of phenotypic data. If the phenotypic data is contaminated with outliers, the number of suspicious QTLs will increase, especially in loci with unbalanced allelic frequency 64 . Therefore, it seems necessary to identify the outliers and then enter the next steps for the accuracy and precision of GWAS. For this reason, we used rPCA to obtain other linear combinations of traits that moderate the effect of outliers as input for GWAS. As far as we searched, such an approach has not been investigated in any GWAS. Under well-watered, SNP rs19833 on chromosome 5B, associated with PC1 in Grid and Proj methods, was involved in protein processing in the endoplasmic reticulum. There is evidence of the consequences of protein processing in the endoplasmic reticulum on the final grain size of wheat 96 , and we know that spike traits strongly affect the final grain size. In addition, based on the above two methods, other markers were identified that play a role in essential candidate genes, such as protein binding, protein phosphorylation, and lipid metabolic process. We saw a similar situation in the well-watered environment where SNP rs48316 located in chromosome 2B, was not associated with any trait but was significantly associated with PC1 in different methods. According to KEGG results, this SNP overlapped with the TraesCS2B02G098500 gene and was present in pathways such as the phosphatidylinositol signaling system, inositol phosphate metabolism, and metabolic pathways. All three of these pathways play a role in tolerance to drought stress [97][98][99] .

Conclusion
When phenotypic and genotypic data are contaminated with outlier observations, PCA and different rPCA algorithms provided promising results in identifying them. We have shown that in such a situation, cPCA should be used cautiously in GWAS due to its sensitivity. Using robust strategies in GWAS, putative alleles for important agronomic traits in wheat were identified that were not found in conventional GWAS. Linear combinations of Hubert and Proj methods moderated the effect of phenotypic outliers to a great extent and effectively identified pleiotropic markers. Also, in GWAS results, the population structure provided by rPCA methods discovered several new QTLs associated with traits. Robust strategies in GWAS reduce the risk of missing interesting rare alleles. Finally, it is necessary to pay more attention to the effect of outlier samples and the above methods in future GWAS studies.

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.